Y binned at each clinical score and MG at species level
Overview:
DIABLO Multi-Omics Integration
Load already QC and normalized data
Clean, normalize, and scale each omics dataset independently.
Select taxa level and combine X
Use PLS-DA to confirm that class separation exists before applying sparsity.
Identify sample sets
Identify the optimal number of components (ncomp) and variables to select per component (keepX) via cross-validation.
Select response variable
Build the sPLS-DA model using the tuned parameters.
Specify design matrix
Evaluate classification accuracy, balanced error rate (BER), and stability of selected variables.
Model
Retrieve the most discriminant variables identified by the sPLS-DA model.
Visualize models
Use the selected features as input for integration frameworks such as DIABLO (block.splsda()).
parameter tuning
There are several variables that wee kan tweak in order to optimise our integration model these include - weights - number of blocks - species or strain level of MG - number of features to select for tuning
I want to test a couple of combinations to see if there are any signifcant changes
X <- X %>%# to add row namesmap_at(c(1,3), ~data.frame(.x, row.names =1)) %>%map_at(c(1,3), ~t(.x)) %>%# convert to matrixmap(., ~as.matrix(.x, ))
3. Identify sample sets
Get group assignment
g <-c("0"="0-1", "1"="0-1", "2"="2-3", "3"="2-3", "4"="4-5", "5"="4-5")gr_data <- meta_data %>%rename(`Symptom score (0-5)`="Symptom score (0-5) - ibland har de svarat olika skriftligt i frågeformuläret och muntligt vid inklusiion, då har jag valt den högsta scoren") %>%select( svamp_ID,`Clinical score (0-5)`,`Fungal culture: C. albicans (y/n)`,`Fungal culture: Non-albicans candida spp. (y/n)`,`Symptom score (0-5)`,`Recurring fungal infections > 2/year (y/n)` ) %>%mutate(group =case_when(`Fungal culture: C. albicans (y/n)`=="1"&`Symptom score (0-5)`>=1&`Recurring fungal infections > 2/year (y/n)`=="1"~"RVVCpos",`Fungal culture: C. albicans (y/n)`=="0"&`Recurring fungal infections > 2/year (y/n)`=="1"~"RVVCneg",`Fungal culture: C. albicans (y/n)`=="1"&`Symptom score (0-5)`==0~"AS",`Fungal culture: C. albicans (y/n)`=="0"&`Recurring fungal infections > 2/year (y/n)`=="0"~"Control",`Fungal culture: C. albicans (y/n)`=="1"&`Recurring fungal infections > 2/year (y/n)`=="0"~"Candidapos",TRUE~NA_character_ )) %>%mutate(pos =case_when(`Fungal culture: C. albicans (y/n)`=="1"|`Fungal culture: Non-albicans candida spp. (y/n)`==1~"pos",TRUE~"neg" ), .after="group") %>%mutate(Clin_gr = g[as.character(.$`Clinical score (0-5)`)], .after ="group") %>%select("svamp_ID", group, everything())
Note, S15 and S10 was filtered out from the luminal metagenomics data, One due to low total count and one had less than 60% of taxa left after filtering.
ids <-map(X, ~rownames(.x)) %>%unlist() %>%unique()# find intersection of all data setsd <-map(X, ~rownames(.x)) ids <-intersect(intersect(d$TR, d$MG), d$MB)gr <-tibble(ID = ids) %>%tibble(ID_ =str_replace(ID, "_run2", "")) %>%left_join(., gr_data, by =c("ID_"="svamp_ID")) %>%filter(!(is.na(.$Clin_gr)))table(gr$Clin_gr)
“full weighted design matrix” Indicates that the design matrix parameters are set to 0.1. This maximizes the separation between sample groups while taking into account the correlation between “omics” datasets.
“full design matrix” a “full design matrix” instead (design matrix parameter set to 1), maximize the correlation between “omics” datasets, prioritizing the association between features of the metabolome with features of the transcriptome. Thus, although the integration of data with DIABLO do not always improve sample classification compared to the most predictive omic data alone, DIABLO remains useful in offering a method to associate features of the metabolome and transcriptome that can discriminate sample groups[1].
Is a multi omics pls-DA that does not perform feature selection
Filter pre-selected variables
# filter out selected variablesf_select <-list(TR = TR_select, MG = MG_select, MB = MB_select)# remove some trxf_select$TR <- f_select$TR %>%filter(!(PC ==1& value <0.8))# removes anny duplicate features selected for more than one PCf_select <- f_select %>%map(~ .x %>%group_by(name) %>%slice_max(value, n =1, with_ties =FALSE) %>%ungroup() )X <- X %>%# filter selected featuresmap2(., f_select, ~.x[,.y$name]) map(X, ~dim(.x))
Fit model
# check # 1. names are identical and in order for list.keepX and X# 2. check that there are no duplicate features in any of the tables in x# 4. features should be columns and samples rowsnames(X)length(Y)map(X, ~anyDuplicated(rownames(.x)))# discrete response variable:diablo.mod <-block.plsda(X, Y, ncomp =2, design = design)
DIABLO, is a multi omics sparse pls-DA. The “saparse” in the name indcates that it includes feature selection
# diablo.plsda <- block.plsda(X, Y, ncomp = 5, design = design)# set.seed(123) # For reproducibility, remove for your analyses# perf.diablo = perf(diablo.plsda, validation = 'Mfold', folds = 3, nrepeat = 10)# #perf.diablo.tcga$error.rate # Lists the different types of error rates# # Plot of the error rates based on weighted vote# plot(perf.diablo)